1 Domain problem formulation

A central goal in precision medicine is to predict a patient’s response to therapeutic drugs given the patient’s unique molecular profile (Rubin 2015; Kohane 2015). Due to the extremely high costs of many drugs, being able to accurately predict a patient’s response to drugs given their molecular profiles is essential for increasing both the health and financial well-being of patients. In this work, we will predict drug responses based on a combination of genomic, proteomic, and epigenomic profiles measured in human cancer cell lines from the Cancer Cell Line Encyclopedia (Barretina et al. 2012). Moving beyond prediction however, a perhaps even more important goal is to identify the -omic features, in particular genes and proteins, that are important in the drug response prediction models. Identifying these -omic signatures is an initial step towards interpreting and understanding the features that drive the drug response prediction methods and may hint at candidates for future preclinical research.

In this PCS documentation for our stability-driven pipeline for drug response interpretable prediction (staDRIP), we will present our analysis from data exploration to modeling to post-hoc analysis, justifying our human judgment calls whenever possible. Through this, we aim to increase transparency and bridge the gap between the biological domain problem and the applied statistical methods.

2 Data

# load in ccle data sets
drug.resp.orig <- loadDrugResp()
mirna.orig <- loadMiRNA()
rnaseq.orig <- loadRNASeq()
methyl.orig <- loadMethyl()
prot.orig <- loadProtein()

# clean ccle data sets
drug.resp <- cleanDrugResp(data = drug.resp.orig)
mirna <- cleanMiRNA(data = mirna.orig)
rnaseq <- cleanRNASeq(data = rnaseq.orig)
methyl <- cleanMethyl(data = methyl.orig)
prot <- cleanProtein(data = prot.orig)

# convert rowname to column for merging
drug.resp.id <- drug.resp %>% rownames_to_column(var = "ID")
mirna.id <- mirna %>% rownames_to_column(var = "ID")
rnaseq.id <- rnaseq %>% rownames_to_column(var = "ID")
methyl.id <- methyl %>% rownames_to_column(var = "ID")
prot.id <- prot %>% rownames_to_column(var = "ID")

# merge entries with data in all four data sets and response data
all.data <- drug.resp.id %>%
  inner_join(x = ., y = mirna.id, by = "ID") %>%
  inner_join(x = ., y = rnaseq.id, by = "ID") %>%
  inner_join(x = ., y = methyl.id, by = "ID") %>%
  inner_join(x = ., y = prot.id, by = "ID") %>%
  column_to_rownames("ID")
saveRDS(all.data, "data/ccle_data_integrated_unfiltered.rds")

In line with our goals of better understanding patient’s responses to drugs given their molecular profiles, the Cancer Cell Line Encyclopedia (CCLE) project is one of the most comprehensive public databases for developing detailed genetic and pharmacologic characterizations of human cancer cell lines (Costello et al. 2014). In particular, the CCLE generated large-scale genomic, epigenomic, and proteomic data - namely, 1) miRNA expression (734 miRNAs), 2) gene expression measured via RNASeq (50114 transcripts), 3) methylation (20192 regions around transcription start sites), and 4) protein expression (214 proteins) - for \(397\) human cancer cell lines, together with pharmacological profiling of \(24\) chemotherapy and target therapy drugs. These cell lines come from \(23\) different tumor/tissue types. In Figure 2.2, we provide a full list of tumor types and the number of cell lines from each tumor type, and in Table 2.1, we briefly describe the technologies used to generate the high-throughput data for each molecular profile of interest. This data can be downloaded from DepMap Public 18Q3 (https://depmap.org/portal/download/).

Table 2.1: Description of molecular profiling technologies used to obtain CCLE data
Profile Description
RNASeq generated with Illumina protocol (TrueSeq) and sequencing machine (either HiSeq 2000 or HiSeq 2500) to profile mRNA
Methylation generated from reduced representation bisulfite sequencing (RRBS sequencing)
miRNA pre-selected miRNAs were profiled with NanoString platform
Protein generated with reverse phase protein array (RPPA)

In addition to the molecular profiling data, the CCLE project incorporated a systematic framework to measure molecular correlates of pharmacological sensitivity in vitro. Specifically, for each cell line-drug combination, two groups of cancer cells are cultured across eight different dosages — one group (the treatment group) is treated with drug at the given dose while the other group (the control group) is treated with blank culture medium. After 72 to 84 hours, assays were used to calculate the percent of growth inhibition by the drug-treated group relative to the control. A very potent drug can inhibit cancer cell growth with very low dose, and an ineffective drug may never reach certain percentage of grow inhibition (i.e. 50%) even with a much higher dose.

This dose-response data was then taken and fitted to one of three models - a constant model, a logistic (sigmoid) model, or a non-parametric spline interpolation (note that this last model represents less than 5% of models). Using a Chi-squared test, the best model for the dose-response curve was selected, and the area between the response curve and \(0\) (i.e., the no response reference level) was defined to be the activity area, or AUC (see Figure 2.1). This AUC is measured on an 8-point scale with 0 corresponding to an inactive compound and 8 corresponding to a compound with \(100\%\) inhibition at all 8 dosages. The AUC is a well-accepted measure of drug sensitivity (Jang et al. 2014; Barretina et al. 2012) and will be the primary response of interest in this work.

Illustration of the activity area (AUC) definition for the drug-response curve.

Figure 2.1: Illustration of the activity area (AUC) definition for the drug-response curve.

In Figure 2.2, we provide a graphical summary of the raw molecular and pharmacological profiling data sets, obtained from the CCLE.

Graphical overview of the raw CCLE moolecular profiling data, which are used to predict the drug responses of 24 therapeutic drugs, as measured via the drug response AUC

Figure 2.2: Graphical overview of the raw CCLE moolecular profiling data, which are used to predict the drug responses of 24 therapeutic drugs, as measured via the drug response AUC

Table 2.2: Frequency of cell lines from each tumor type
Tumor Type Frequency
LUNG 72
HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 58
SKIN 36
BREAST 26
CENTRAL_NERVOUS_SYSTEM 23
OVARY 22
LARGE_INTESTINE 21
PANCREAS 20
ENDOMETRIUM 15
STOMACH 14
OESOPHAGUS 13
LIVER 12
URINARY_TRACT 12
AUTONOMIC_GANGLIA 9
SOFT_TISSUE 9
BONE 8
KIDNEY 7
UPPER_AERODIGESTIVE_TRACT 6
THYROID 5
PLEURA 4
PROSTATE 3
BILIARY_TRACT 1
SALIVARY_GLAND 1

2.1 Data cleaning and preprocessing

Given the raw data described above, there are a couple areas of initial concern that warrant preprocessing. First, the cancer cell lines encompass 23 different tumor sites with varying frequencies (see Table 2.1), and cell lines from the same tumor site tend to have more similar expression profiles than cell lines from different sites. To illustrate, we observe clusters of cell lines by tumor site when performing both hierarchical clustering and PCA on the RNASeq profile in Figure 2.3. Here, since there are 23 different tissues, we highlight only the five most prominent tissue groups (orange = central nervous system, blue = haematopoietic and lymphoid tissue, green = large intestine, pink = pancreas, brown = skin) and color the other 18 tissues in grey to avoid cluttering. The other molecular profiles (i.e. miRNA, methylation, and protein) exhibit a similar phenomenon, which is to say that the tumor type explains the most variation and is the most dominant (unsupervised) pattern in each profile.

(Left) Hierarchical clustering with Ward's linkage and (Right) PCA to the (log-transformed) RNASeq data set, colored by tumor site(Left) Hierarchical clustering with Ward's linkage and (Right) PCA to the (log-transformed) RNASeq data set, colored by tumor site

Figure 2.3: (Left) Hierarchical clustering with Ward’s linkage and (Right) PCA to the (log-transformed) RNASeq data set, colored by tumor site

Due to these inherent differences and variability between tissues, we choose to omit the cell lines from tissues with fewer than 8 cell lines (i.e. the kidney, upper aerodigestive tract, thyroid, pleura, prostate, biliary tract, and salivary gland). This reduces our sample size to \(370\) cell lines from \(16\) tumor sites.

# tissues with < 8 cell lines
exclude.tissues <- table(all.tissue) %>% 
  as.data.frame() %>%
  filter(Freq < 8) %>%
  pull(all.tissue) %>%
  droplevels()

# ignore cell lines from tissues with < 8 cell lines
all.data <- all.data[!(all.tissue %in% exclude.tissues), ]
all.tissue <- all.tissue[!(all.tissue %in% exclude.tissues)] %>% droplevels()

From these remaining samples, we partition the data into a training, validation, and test set using a 50-25-25% partitioning scheme, stratified on tumor type. Note that the minimum threshold of \(8\) cell lines in each tumor site was chosen to ensure we have at least 2 cell lines from each tumor site in each of the training, validation, and test splits.

# splitting proportions; can change as needed
test.prop <- 1/4
valid.prop <- 1/4
train.prop <- 1 - test.prop - valid.prop

# stratified split data into training, validation, and test based upon 
set.seed(100)
data.split <- data.frame(tissue = all.tissue) %>%
  group_by(tissue) %>%
  mutate(split.idx = sample(c(rep("train", ceiling(train.prop * n())),
                              rep("valid", ceiling(valid.prop * n())),
                              rep("test", ceiling(test.prop * n()))),
                            size = n(), replace = F)) %>%
  ungroup()

# list of three: train, valid, test
data.split <- split(all.data, data.split$split.idx) 

# split merged data back into response, mirna, rnaseq, methyl, and prot
data.split.ls <- lapply(
  data.split, 
  FUN = function(X) {
    df.idx <- c(rep("drug.resp", times = ncol(drug.resp)),
                rep("mirna", times = ncol(mirna)),
                rep("rnaseq", times = ncol(rnaseq)),
                rep("methyl", times = ncol(methyl)),
                rep("prot", times = ncol(prot)))
    
    X <- X %>%
      t() %>% as.data.frame() %>% # transpose data.frame
      split(., df.idx) %>% # split df into response, mirna, rnaseq, methyl, prot
      lapply(FUN = function(x) {return(as.data.frame(t(x)))}) # transpose back
    
    return(X)
  })

# save data split
saveRDS(data.split.ls$train, "./data/train.rds")
saveRDS(data.split.ls$valid, "./data/valid.rds")
saveRDS(data.split.ls$test, "./data/test.rds")

In addition to reducing the number of samples in our analysis, we also chose to reduce the number of features to manageable sizes before continuing with our analyses. Originally, the CCLE data consisted of 734 miRNAs, 50114 genes, 20192 transcription start sites (TSS), and 214 antibodies. Given the incredibly large number of features in the RNASeq and methylation data sets, we aggressively preprocessed the number of genes and TSS to manageable sizes by taking the top \(10\%\) of genes (or 5000 genes) and top \(20\%\) of TSS (or 4000 TSS) with the highest variance. We also transformed the miRNA and RNASeq expression values using the log-transformation \(\log(x+1)\) in order to mitigate problems with highly skewed positive count values.

# read in pre-processed training, validation, and test sets -------------------
train <- readRDS("./data/train.rds")
valid <- readRDS("./data/valid.rds")
test <- readRDS("./data/test.rds")

# training data ---------------------------------------------------------------
drug.resp <- train$drug_resp
mirna <- train$mirna
rnaseq <- train$rnaseq
methyl <- train$methyl
prot <- train$prot

# helper variables
nvalid <- nrow(valid$drug_resp)
ntest <- nrow(test$drug_resp)
ndrugs <- ncol(drug.resp)
types <- c("mirna", "rnaseq", "methyl", "prot") # data types

# remove columns in rnaseq with no variance
rnaseq <- rnaseq %>%
  select(-which(apply(., 2, var) == 0))

# impute NAs in methylation with the column mean
methyl.imputed <- impute_mean(methyl) %>%
  # remove columns in methyl with no variance
  select(-which(apply(., 2, var, na.rm = T) == 0))

# get tissue type for each cell line
cell.lines <- rownames(drug.resp)
tissue <- str_split(cell.lines, pattern = "_", n = 2, simplify = T)[, 2] %>%
  as.factor()

# validation data -------------------------------------------------------------
mirna.valid <- valid$mirna
rnaseq.valid <- valid$rnaseq[, colnames(rnaseq)]
methyl.valid <- valid$methyl[, colnames(methyl.imputed)] %>%
  replace_na(data = ., replace = as.list(colMeans(methyl.imputed)))
prot.valid <- valid$prot
resp.valid <- valid$drug_resp

# get tissue type for each cell line in validation set
cell.lines.valid <- rownames(resp.valid)
tissue.valid <- str_split(cell.lines.valid,
  pattern = "_",
  n = 2, simplify = T
)[, 2] %>%
  as.factor()

# test data -------------------------------------------------------------
mirna.test <- test$mirna
rnaseq.test <- test$rnaseq[, colnames(rnaseq)]
methyl.test <- test$methyl[, colnames(methyl.imputed)] %>%
  replace_na(data = ., replace = as.list(colMeans(methyl.imputed)))
prot.test <- test$prot
resp.test <- test$drug_resp

# filter features -------------------------------------------------------------
rnaseq.filt <- filterByVar(dat = rnaseq, max.p = 5000)
methyl.filt <- filterByVar(dat = methyl.imputed, max.p = 4000)
rnaseq.valid.filt <- rnaseq.valid[, colnames(rnaseq.filt)]
methyl.valid.filt <- methyl.valid[, colnames(methyl.filt)]
rnaseq.test.filt <- rnaseq.test[, colnames(rnaseq.filt)]
methyl.test.filt <- methyl.test[, colnames(methyl.filt)]

# put cleaned/filtered training data in list ----------------------------------
all.data <- list(
  mirna = log(mirna + 1),
  rnaseq = log(rnaseq.filt + 1),
  methyl = methyl.filt,
  prot = prot
)

# put cleaned/filtered validation data in list --------------------------------
all.data.valid <- list(
  mirna = log(mirna.valid + 1),
  rnaseq = log(rnaseq.valid.filt + 1),
  methyl = methyl.valid.filt,
  prot = prot.valid
)

# put cleaned/filtered validation data in list --------------------------------
all.data.test <- list(
  mirna = log(mirna.test + 1),
  rnaseq = log(rnaseq.test.filt + 1),
  methyl = methyl.test.filt,
  prot = prot.test
)

# put cleaned/filtered data in concatenated data matrix -----------------------
concat.data <- do.call(cbind, all.data)
concat.data.valid <- do.call(cbind, all.data.valid)
concat.data.test <- do.call(cbind, all.data.test)

We recognize however that there were many other reasonable ways to preprocess this data. For instance, we could have taken the top 20% of genes and top 40% of TSS with the highest variance. Another common alternative would have been to filter features using marginal correlations with the response or using a multivariate prediction model (e.g., the Lasso). To assess robustness to these choices, we reran our analysis using the following alternative preprocessing procedures in ccle_preprocessing_alternatives.Rmd:

  • Variance-filtering, taking the top \(20\%\) of genes and \(40\%\) of TSS (i.e., building a model twice as large as the current work)
  • Filtering by marginal correlations between the features and drug responses, taking the top \(10\%\) of genes and \(20\%\) of TSS
  • Using the Lasso for filtering and keeping the top \(10\%\) of genes and \(20\%\) of TSS with the largest coefficients (in magnitude)

Across these different preprocessing procedures, the prediction accuracies are relatively stable across the number of features taken as well as the filtering method. Due to the high computational complexity of fitting numerous models (and data-perturbed models) for 24 drugs, we thus choose to focus our attention on the variance-filtering procedure, taking the top \(10\%\) of genes and \(20\%\) of TSS. This procedure is preferred over the alternatives as it is less computationally expensive than the model with twice as many features, and while marginal correlations and multivariate prediction models such as the Lasso look at the response, variance-filtering avoids any risk of data snooping and over-fitting. Moving forward, we will use and focus primarily on the variance-filtering procedure, taking the top \(10\%\) of genes and \(20\%\) of TSS.

To summarize, after preprocessing and data cleaning, we arrive at \(370\) cell lines with molecular data across the four molecular profiles of interest:

  • miRNA (log-transformed): p = 734 miRNAs,
  • RNASeq (log-transformed): p = 5000 genes,
  • Methylation: p = 4000 transcription start sites (TSS),
  • Protein: p = 214 antibodies,

and pharmacological data (in particular the AUC drug response scores) for 24 compounds. We provide a quick summary of the resulting pre-processed data by plotting the overall distribution of features in the four data sets as well as the distribution of the 24 drug responses in Figure 2.4.

In the left column, we plot the distribution of the features in each of the four data set. On the right, we show the distribution of the responses for each of the 24 drugs.

Figure 2.4: In the left column, we plot the distribution of the features in each of the four data set. On the right, we show the distribution of the responses for each of the 24 drugs.

3 Prediction formulation

Before turning to our primary goal of identifying important genes and proteins for drug response prediction, we first use predictive accuracy as a reality check to filter out models that are poor fits for the observed data. As in data preprocessing, there are many human judgment calls here in the predictive modeling stage that warrant justification including the profiling data types used for training and the decision of which methods to fit.

3.1 Choosing the molecular profiles

In terms of the molecular profiles used in the fitting, since gene expression is known to be the most predictive profile of drug sensitivity (Costello et al. 2014), we choose to first fit predictive methods on each molecular profiling type (i.e., miRNA, RNASeq, methylation, and protein expression profiles) separately and take this non-integrative approach as our baseline. Similar to Costello et al. (2014), we will show that the CCLE data follows a similar pattern — the RNASeq data is the most predictive of drug sensitivity. However, in some cases, predictive performance can increase when training on multiple profiles simultaneously (Costello et al. 2014). Since different types of molecules all interact within the same biological system, an integrative approach that jointly incorporates the miRNA, RNASeq, methylation, and protein data sets may be able to provide a more holistic and perhaps more realistic model of the biological phenomenon at hand. We thus also try various integrative approaches including training on the concatenated molecular profiling data and other existing integrative -omics methods.

3.2 Choosing the prediction models

In the ideal setting, the prediction methods chosen should have some justified connection to the biological problem at hand, but a priori, it is unclear which models or assumptions best fit the biological drug response mechanism. Nonetheless, we have reasons to believe that the Lasso, elastic net, random forest, and kernel ridge regression are particularly appealing fits for this problem. We discuss each of these models, their assumptions, and justifications next.

First, the Lasso assumes a sparse linear model, meaning that the effect of each feature is additive and only a sparse number of the features contribute to the drug sensitivity. The simplicity and interpretability of the Lasso makes it a popular tool for bioinformatics prediction tasks, so we choose to use the Lasso as a baseline model for our analysis. The elastic net is perhaps even more popular than the Lasso in drug response prediction studies (Jang et al. 2014; Barretina et al. 2012). Similar to the Lasso, the elastic net assumes linearity and some sparsity but is also able to better handle correlated features. Beyond linearity, kernel ridge regression with a Gaussian kernel allows for more flexible, but less interpretable, functional relationships that are not necessarily linear. Kernel methods have been applied in previous case studies with great success (Costello et al. 2014) and are hence promising candidates for our study as well. Lastly, random forest can be viewed as a collection of localized, non-linear thresholded decisions rules (like on-off switches), which are well-suited for many biological processes that match the combinatorial thresholding (or switch-like) behavior of decision trees (Nelson, Lehninger, and Cox 2008). Random forests are also invariant to the scale of the data. This is especially advantageous for integrating different data sets with varying scales and domain types (e.g., count-valued RNASeq expression, proportion-valued methylation data, continuous-valued protein expression).

Note that though an alternative approach would have been to develop new methodology, we instead leverage these existing machine learning methods that have been rigorously vetted and have been shown to work well in many related problems.

3.3 Prediction evaluation metrics

To measure the accuracy of the models, we consider many different metrics as each captures a different aspect of prediction and has advantages and disadvantages:

  1. Mean-squared Error (\(MSE\)) = the mean sum of squared errors between the predicted responses and the observed responses. This is the most widely used prediction metric in machine learning but can be volatile due to outliers.
  2. \(R^2\) = \(1 - \frac{\text{MSE}(Y, \hat{Y})}{\text{Var}(Y)}\), where \(\text{Var}(Y)\) denotes the variance of the observed responses, and \(\text{MSE}(Y, \hat{Y})\) denotes the mean sum of squared errors between the predicted responses \(\hat{Y}\) and observed responses \(Y\). \(R^2\) is a rescaling of the MSE that accounts for the amount of variation in the observed response and thus allows us to more easily compare accuracies between drug response models with different amounts of variation in the observed response, but as with the MSE, \(R^2\) can be heavily influenced by outliers.
  3. Correlation = the (Pearson) correlation between the predicted responses and the observed responses. The correlation metric is scale-invariant and thus conveniently enables direct comparisons between models for different drugs.
  4. Probabilistic concordance-index (PC-index) = a measure of how well the predicted rankings agree with the true responses. This metric, like the correlation metric, takes into account the variance of the drug responses, \(y\), but it also assumes that the drug responses follow a Gaussian distribution, which may not be true in some cases. We consider this metric because it is the primary method of evaluation in Costello et al. (2014). Given the large scale and breadth of the NCI-DREAM challenge in Costello et al. (2014), we will compare our results to this work. We refer to Costello et al. (2014) for details on the exact computation of the PC-index.

In each of the proposed evaluation metrics above, we receive a score for each of the 24 drug response models. It may also be beneficial to aggregate the 24 scores into a single number for concrete evaluation. In particular, Costello et al. (2014) used a weighted average of the PC-indices to compare various models and referred to this evaluation metric as the weighted PC-index (WPC-index). To compare our results with those in Costello et al. (2014), we also consider the WPC-index in evaluating our models.

4 Stability formulation

Beyond predictability checks, we emphasize the importance of stability throughout the data science life cycle so as to reduce reliance on particular human judgment calls. In particular, our primary goal is to identify the genes and proteins which are predictive of drug sensitivity, and stability serves as a minimium requirement in this scientific knowledge discovery. Ultimately, successfully identifying these -omic features that are stable and predictive of drug sensitivity is a first step towards a more reliable scientific recommendation system for preclinical research. This has important practical implications, especially given the reliance and success of hypothesis-driven studies in previous cancer research.

With this goal of reliable interpretability in mind, we leverage the stability principle and quantify the stability of important features under numerous data and model perturbations. Here, we say an -omic feature is stable if it is stable across both data-perturbed bootstrap samples and across multiple models with similarly high predictive power. That is, if a particular feature is frequently selected or deemed important even when slightly perturbing the data via bootstrap samples and when using different models, then we say that the feature is stable. This stability across data perturbations and models gives us evidence to believe that the feature is associated with the desired response (in this case, the drug sensitivity measured via AUC).

5 Model Fitting & Prediction Analysis

5.1 Non-integrative fits

For each of the 24 drugs, we begin by fitting the Lasso, elastic net (with \(\alpha = 0.5\)), kernel ridge regression (with a Gaussian kernel), and RF to each molecular profiling data set separately as a baseline measure. To select hyperparameters in each method, we use 5-fold cross validation, where the folds are stratified by tissue type. We also investigate using the estimation stability cross validation (ESCV) metric for selecting the Lasso’s hyperparameter as this metric combines a stability measure within the cross-validation framework to yield more stable estimation properties with minimal loss of accuracy (Lim and Yu 2016).

# setting hyperparameters
nfolds <- 5
ntree <- 500
mtrys <- list(
  mirna = seq(150, 350, by = 50),
  rnaseq = seq(1000, 2000, by = 250),
  methyl = seq(1000, 2000, by = 250),
  prot = seq(50, 90, by = 10)
)
# fit non-integrative models for each of the 24 drugs
registerDoParallel(cores = 24)
for (type in types) {
  
  out <- foreach(drug = 1:ndrugs, 
                 .packages = c("glmnet", "KRLS", "caret")) %dopar% {
    
    Xtrain <- as.matrix(all.data[[type]])
    Xvalid <- as.matrix(all.data.valid[[type]])
    
    # remove data with missing responses
    missing.idx <- which(is.na(drug.resp[, drug]))
    if (length(missing.idx) >= 1) {
      Xtrain <- Xtrain[-missing.idx, ]
      y <- drug.resp[-missing.idx, drug]
      tissue.type <- tissue[-missing.idx]
    } else {
      y <- drug.resp[, drug]
      tissue.type <- tissue
    }
    
    # stratified cv folds
    foldid <- data.frame(tissue = tissue.type) %>%
      group_by(tissue) %>%
      mutate(split.idx = sample(rep(1:nfolds, each = ceiling(n() / nfolds)),
                                size = n(), replace = F)) %>%
      pull(split.idx)
    
    # fit lasso+cv
    lasso.fit <- cv.glmnet(x = Xtrain, y = y, 
                           nfolds = nfolds, alpha = 1, foldid = foldid)
    
    # fit lasso+escv
    lambda.escv <- escv.glmnet(x = Xtrain, y = y, nfolds = nfolds, 
                               lambda = seq(0, 0.2, 0.002),
                               alpha = 1, foldid = foldid)$lambda.escv
    lasso.escv.fit <- glmnet(x = Xtrain, y = y, alpha = 1, lambda = lambda.escv)
    
    # fit elastic net+cv
    elnet.fit <- cv.glmnet(x = Xtrain, y = y, 
                           nfolds = nfolds, alpha = .5, foldid = foldid)
    
    # fit kernel
    kernel.fit <- krls(X = Xtrain, y = y, 
                       whichkernel = "gaussian", print.level = 0)
  
    # fit rf
    rf.fit <- caret::train(
      x = Xtrain, y = y, preProcess = NULL, method = "ranger",
      tuneGrid = expand.grid(mtry = mtrys[[type]],
                             splitrule = "variance",
                             min.node.size = c(3, 5)),
      trControl = trainControl(method = "cv",
                               number = nfolds,
                               verboseIter = F,
                               selectionFunction = "best", 
                               index = groupKFold(tissue.type, k = nfolds),
                               allowParallel = F),
      importance = "impurity", num.trees = ntree, num.threads = 1
    )$finalModel
  
    # make predictions
    lasso.pred <- predict(lasso.fit, newx = Xvalid, s = "lambda.min")
    lasso.escv.pred <- predict(lasso.escv.fit, newx = Xvalid)
    elnet.pred <- predict(elnet.fit, newx = Xvalid, s = "lambda.min")
    kernel.pred <- predict(kernel.fit, newdata = Xvalid)$fit
    rf.pred <- predict(rf.fit, Xvalid, num.threads = 1)$predictions
    
    list(lasso.fits = lasso.fit,
         lasso.escv.fits = lasso.escv.fit,
         elnet.fits = elnet.fit,
         kernel.fits = kernel.fit,
         rf.fits = rf.fit,
         lasso.preds = lasso.pred,
         lasso.escv.preds = lasso.escv.pred,
         elnet.preds = elnet.pred,
         kernel.preds = kernel.pred,
         rf.preds = rf.pred)
  }
  
  lasso.fits <- lapply(out, FUN = function(X) {return(X$lasso.fits)})
  lasso.escv.fits <- lapply(out, FUN = function(X) {return(X$lasso.escv.fits)})
  elnet.fits <- lapply(out, FUN = function(X) {return(X$elnet.fits)})
  kernel.fits <- lapply(out, FUN = function(X) {return(X$kernel.fits)})
  rf.fits <- lapply(out, FUN = function(X) {return(X$rf.fits)})
  lasso.preds <- mapply(out, FUN = function(X) {return(X$lasso.preds)})
  lasso.escv.preds <- mapply(out, 
                             FUN = function(X) {return(X$lasso.escv.preds)})
  elnet.preds <- mapply(out, FUN = function(X) {return(X$elnet.preds)})
  kernel.preds <- mapply(out, FUN = function(X) {return(X$kernel.preds)})
  rf.preds <- mapply(out, FUN = function(X) {return(X$rf.preds)})
  
  indiv.preds.ls <- list(Lasso = lasso.preds,
                         `Lasso (ESCV)` = lasso.escv.preds,
                         `Elastic Net` = elnet.preds,
                         Kernel = kernel.preds,
                         RF = rf.preds)
  
  # compute mse
  indiv.mse <- mapply(indiv.preds.ls, FUN = function(preds) {
    return(colMeans((preds - resp.valid)^2, na.rm = T))
  }) %>% t()
  
  # compute correlation
  indiv.corr <- mapply(indiv.preds.ls, FUN = function(preds) {
    return(diag(cor(x = preds, y = resp.valid, use = "pairwise.complete.obs")))
  }) %>% t()
  
  # compute pc metric
  pc.null <- pc.nulldist(resp.valid)
  indiv.pc <- mapply(indiv.preds.ls, FUN = function(preds) {
    return(pcs(preds, resp.valid, pc.null))
  }) %>% t()
  
  # compute wpc metric
  indiv.wpc <- mapply(indiv.preds.ls, FUN = function(preds) {
    return(wpc(preds, resp.valid, pc.null))
  })
  
  saveRDS(indiv.mse, paste0(save.path, type, "_mse.rds"))
  saveRDS(indiv.corr, paste0(save.path, type, "_corr.rds"))
  saveRDS(indiv.pc, paste0(save.path, type, "_pc.rds"))
  saveRDS(indiv.wpc, paste0(save.path, type, "_wpc.rds"))
  saveRDS(lasso.fits, paste0(save.path, type, "_lasso_fits.rds"))
  saveRDS(lasso.escv.fits, paste0(save.path, type, "_lasso_escv_fits.rds"))
  saveRDS(elnet.fits, paste0(save.path, type, "_elnet_fits.rds"))
  saveRDS(kernel.fits, paste0(save.path, type, "_kernel_fits.rds"))
  saveRDS(rf.fits, paste0(save.path, type, "_rf_fits.rds"))
}

In Table 5.1, we summarize the validation prediction accuracy using the various machine learning methods, trained on each profile separately. We see that for each method, the RNASeq data set contains the most predictive power, as expected, and across methods, the kernel ridge regression model outperforms the other candidate models. Note that these results have been aggregated across the 24 drugs.

Table 5.1: For various non-integrative methods applied to each molecular profile separately, we evaluate the validation WPC-index, average R^2, and average correlation across the 24 drug response models and highlight the best model according to each metric in bold.
R-squared
Correlation
WPC-index
Methylation miRNA Protein RNASeq Methylation miRNA Protein RNASeq Methylation miRNA Protein RNASeq
Kernel Ridge 0.111 0.104 0.168 0.231 0.357 0.319 0.413 0.488 0.600 0.603 0.617 0.631
Elastic Net 0.102 0.124 0.126 0.183 0.346 0.340 0.362 0.439 0.602 0.606 0.608 0.626
Lasso 0.117 0.105 0.121 0.172 0.358 0.306 0.353 0.433 0.597 0.605 0.609 0.620
Lasso (ESCV) 0.114 0.113 0.129 0.195 0.362 0.335 0.354 0.450 0.600 0.601 0.609 0.623
RF 0.124 0.0882 0.123 0.214 0.387 0.321 0.352 0.483 0.599 0.594 0.606 0.626


At the individual drug level, the figures below provide a more detailed picture of the validation accuracies for each individual drug. Here, we see that for most drugs, the RNASeq profile indeed provides the greatest predictive power, but for other drugs (e.g., Erlotinib, Lapatinib), the protein expression data yields the best accuracy.

Validation MSE

Kernel Ridge

Validaiton MSE from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_100)Validaiton MSE from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

Elastic Net

Validaiton MSE from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_101)Validaiton MSE from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

Lasso

Validaiton MSE from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_102)Validaiton MSE from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

Lasso (ESCV)

Validaiton MSE from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_103)Validaiton MSE from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

RF

Validaiton MSE from the RF fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_104)Validaiton MSE from the RF fit for each of the 24 drugs when trained on each molecular profile separately

Validation R2

Kernel Ridge

Validaiton R2 from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_105)Validaiton R2 from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

Elastic Net

Validaiton R2 from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_106)Validaiton R2 from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

Lasso

Validaiton R2 from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_107)Validaiton R2 from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

Lasso (ESCV)

Validaiton R2 from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_108)Validaiton R2 from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

RF

Validaiton R2 from the RF fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_109)Validaiton R2 from the RF fit for each of the 24 drugs when trained on each molecular profile separately

Validation Correlation

Kernel Ridge

Validaiton Correlation from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_110)Validaiton Correlation from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

Elastic Net

Validaiton Correlation from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_111)Validaiton Correlation from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

Lasso

Validaiton Correlation from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_112)Validaiton Correlation from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

Lasso (ESCV)

Validaiton Correlation from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_113)Validaiton Correlation from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

RF

Validaiton Correlation from the RF fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_114)Validaiton Correlation from the RF fit for each of the 24 drugs when trained on each molecular profile separately

Validation PC-index

Kernel Ridge

Validaiton PC-index from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_115)Validaiton PC-index from the Kernel Ridge fit for each of the 24 drugs when trained on each molecular profile separately

Elastic Net

Validaiton PC-index from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_116)Validaiton PC-index from the Elastic Net fit for each of the 24 drugs when trained on each molecular profile separately

Lasso

Validaiton PC-index from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_117)Validaiton PC-index from the Lasso fit for each of the 24 drugs when trained on each molecular profile separately

Lasso (ESCV)

Validaiton PC-index from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_118)Validaiton PC-index from the Lasso (ESCV) fit for each of the 24 drugs when trained on each molecular profile separately

RF

Validaiton PC-index from the RF fit for each of the 24 drugs when trained on each molecular profile separately

(#fig:sub_chunk_119)Validaiton PC-index from the RF fit for each of the 24 drugs when trained on each molecular profile separately

When comparing the validation errors across models, Table 5.2 indicates that the best model depends on the drug. According to the \(R^2\) value, kernel ridge regression trained on the RNASeq data set was the best non-integrative model for 12 out of the 24 drugs. On the other hand, the elastic net, Lasso (tuned by CV), Lasso (tuned by ESCV), and RF trained on the RNASeq data were the best non-integrative models for 1, 1, 4, and 6 out of the 24 drugs, respectively.

Validation R2

Table 5.2: For each molecular pfoile used for fitting, we count the number of drugs (out of 24) for which each method performed the best according to the validation R2 metric compared to the competing methods.
Methylation miRNA Protein RNASeq
Kernel Ridge 7 5 16 12
Elastic Net 1 8 1 1
Lasso 3 2 2 1
Lasso (ESCV) 4 4 3 4
RF 9 5 2 6

Validation Correlation

Table 5.2: For each molecular pfoile used for fitting, we count the number of drugs (out of 24) for which each method performed the best according to the validation Correlation metric compared to the competing methods.
Methylation miRNA Protein RNASeq
Kernel Ridge 6 6 16 11
Elastic Net 2 10 2 4
Lasso 2 1 3 2
Lasso (ESCV) 4 3 1 1
RF 10 5 2 6

Validation PC-index

Table 5.2: For each molecular pfoile used for fitting, we count the number of drugs (out of 24) for which each method performed the best according to the validation PC-index metric compared to the competing methods.
Methylation miRNA Protein RNASeq
Kernel Ridge 6 6 15 12
Elastic Net 3 5 5 3
Lasso 3 3 2 3
Lasso (ESCV) 3 4 1 2
RF 9 6 1 4

We can further see in Figure 5.1 that for a given drug, the predictions from kernel ridge regression trained on each molecular profile separately are positively correlated but still have considerable variability. This variation along with the reasonable predictive accuracy across data sets provide some indication that though the RNASeq profile is the single most predictive profile of drug sensitivity, the non-RNASeq profiles may have something new to offer and are not completely redundant.

After fitting kernel ridge regression on each molecular profile separately, we plot the predicted Erlotinib AUC responses on the validation data. Though the predictions are positively correlated, there is still considerable variability across the non-integrative model predictions.

Figure 5.1: After fitting kernel ridge regression on each molecular profile separately, we plot the predicted Erlotinib AUC responses on the validation data. Though the predictions are positively correlated, there is still considerable variability across the non-integrative model predictions.

5.2 Integrative fits

Motivated by this potential improvement from data integration, we next compare the results of the non-integrative fits to various data integration methods. The most simple and natural integration idea is to fit concatenated versions of the aforementioned methods (i.e., applying the Lasso, kernel ridge, or RF to the data set where we concatenate all four molecular profiles). For methods such as the elastic net, Lasso, and kernel ridge which are not scale-invariant, we normalized each column in the concatenated data to have mean 0 and variance 1 before fitting on the concatenated data set. We also try fitting more sophisticated methods, which have been recently proposed to integrate -omics data. These include the X-shaped Variational Autoencoder (X-VAE), a variational autoencoder proposed for cancer data integration (Simidjievski et al. 2019), and a Bayesian multitask multiple kernel learning method (BMTMKL), which won the prominent DREAM 7 challenge out of 44 total team submissions (Costello et al. 2014).

# fit concatenated models
registerDoParallel(cores = 24)
out <- foreach(drug = 1:ndrugs, 
               .packages = c("glmnet", "KRLS", "caret")) %dopar% {
  
  # remove data with missing responses
  missing.idx <- which(is.na(drug.resp[, drug]))
  if (length(missing.idx) >= 1) {
    Xtrain.ls <- lapply(all.data, FUN = function(X) {return(X[-missing.idx, ])})
    Xvalid.ls <- all.data.valid
    y <- drug.resp[-missing.idx, drug]
    tissue.type <- tissue[-missing.idx]
  } else {
    Xtrain.ls <- all.data
    Xvalid.ls <- all.data.valid
    y <- drug.resp[, drug]
    tissue.type <- tissue
  }
  
  # concatenated data
  Xtrain <- as.matrix(do.call(args = Xtrain.ls, what = cbind))
  Xvalid <- as.matrix(do.call(args = Xvalid.ls, what = cbind))
  
  # stratified cv folds
  foldid <- data.frame(tissue = tissue.type) %>%
    group_by(tissue) %>%
    mutate(split.idx = sample(rep(1:nfolds, each = ceiling(n() / nfolds)),
                              size = n(), replace = F)) %>%
    pull(split.idx)
  
  # fit lasso
  lasso.fit <- cv.glmnet(x = Xtrain, y = y, 
                         nfolds = nfolds, alpha = 1, foldid = foldid)
  
  # fit lasso+escv
  lambda.escv <- escv.glmnet(x = Xtrain, y = y, nfolds = nfolds, 
                             lambda = seq(0, 0.2, 0.002),
                             alpha = 1, foldid = foldid)$lambda.escv
  lasso.escv.fit <- glmnet(x = Xtrain, y = y, alpha = 1, lambda = lambda.escv)
  
  # fit elastic net
  elnet.fit <- cv.glmnet(x = Xtrain, y = y, 
                         nfolds = nfolds, alpha = 0.5, foldid = foldid)
  
  # fit kernel
  kernel.fit <- krls(X = Xtrain, y = y, 
                     whichkernel = "gaussian", print.level = 0)

  # fit rf
  rf.fit <- caret::train(
    x = Xtrain, y = y, preProcess = NULL, method = "ranger",
    tuneGrid = expand.grid(mtry = seq(2000, 3000, by = 250),
                           splitrule = "variance",
                           min.node.size = c(3, 5)),
    trControl = trainControl(method = "cv",
                             number = nfolds,
                             verboseIter = F,
                             selectionFunction = "best",
                             index = groupKFold(tissue.type, k = nfolds),
                             allowParallel = F),
    importance = "impurity", num.trees = ntree, num.threads = 1
  )$finalModel
  
  # make predictions
  lasso.pred <- predict(lasso.fit, newx = Xvalid, s = "lambda.min")
  lasso.escv.pred <- predict(lasso.escv.fit, newx = Xvalid)
  elnet.pred <- predict(elnet.fit, newx = Xvalid, s = "lambda.min")
  kernel.pred <- predict(kernel.fit, newdata = Xvalid)$fit
  rf.pred <- predict(rf.fit, Xvalid, num.threads = 1)$predictions
  
  list(lasso.fits = lasso.fit,
       lasso.escv.fits = lasso.escv.fit,
       elnet.fits = elnet.fit,
       kernel.fits = kernel.fit,
       rf.fits = rf.fit,
       lasso.preds = lasso.pred,
       lasso.escv.preds = lasso.escv.pred,
       kernel.preds = kernel.pred,
       rf.preds = rf.pred)
}

lasso.fits <- lapply(out, FUN = function(X) {return(X$lasso.fits)})
lasso.escv.fits <- lapply(out, FUN = function(X) {return(X$lasso.escv.fits)})
elnet.fits <- lapply(out, FUN = function(X) {return(X$elnet.fits)})
kernel.fits <- lapply(out, FUN = function(X) {return(X$kernel.fits)})
rf.fits <- lapply(out, FUN = function(X) {return(X$rf.fits)})
lasso.preds <- mapply(out, FUN = function(X) {return(X$lasso.preds)})
lasso.escv.preds <- mapply(out, FUN = function(X) {return(X$lasso.escv.preds)})
elnet.preds <- mapply(out, FUN = function(X) {return(X$elnet.preds)})
kernel.preds <- mapply(out, FUN = function(X) {return(X$kernel.preds)})
rf.preds <- mapply(out, FUN = function(X) {return(X$rf.preds)})

concat.preds.ls <- list(Lasso = lasso.preds,
                        `Lasso (ESCV)` = lasso.escv.preds,
                        `Elastic Net` = elnet.preds,
                        Kernel = kernel.preds,
                        RF = rf.preds)

# compute mse
concat.mse <- mapply(concat.preds.ls, FUN = function(preds) {
  return(colMeans((preds - resp.valid)^2, na.rm = T))
}) %>% t()

# compute correlation
concat.corr <- mapply(concat.preds.ls, FUN = function(preds) {
  return(diag(cor(x = preds, y = resp.valid, use = "pairwise.complete.obs")))
}) %>% t()

# compute pc metric
pc.null <- pc.nulldist(resp.valid)
concat.pc <- mapply(concat.preds.ls, FUN = function(preds) {
  return(pcs(preds, resp.valid, pc.null))
}) %>% t()

# compute wpc metric
concat.wpc <- mapply(concat.preds.ls, FUN = function(preds) {
  return(wpc(preds, resp.valid, pc.null))
})

saveRDS(concat.mse, paste0(save.path, "concatenated_mse.rds"))
saveRDS(concat.corr, paste0(save.path, "concatenated_corr.rds"))
saveRDS(concat.pc, paste0(save.path, "concatenated_pc.rds"))
saveRDS(concat.wpc, paste0(save.path, "concatenated_wpc.rds"))
saveRDS(lasso.fits, paste0(save.path, "concatenated_lasso_fits.rds"))
saveRDS(lasso.escv.fits, paste0(save.path, "concatenated_lasso_escv_fits.rds"))
saveRDS(elnet.fits, paste0(save.path, "concatenated_elnet_fits.rds"))
saveRDS(kernel.fits, paste0(save.path, "concatenated_kernel_fits.rds"))
saveRDS(rf.fits, paste0(save.path, "concatenated_rf_fits.rds"))
# for x-vae fit, see xvae/01_xvae.ipynb
# fit BMTMKL
source("./baseline_bmtmkl/bayesian_multitask_multiple_kernel_learning_train.R")
source("./baseline_bmtmkl/bayesian_multitask_multiple_kernel_learning_test.R")
source("./baseline_bmtmkl/wpc_index.R")

#initalize the parameters of the algorithm
parameters <- list()

#set the hyperparameters of gamma prior used for sample weights
parameters$alpha_lambda <- 1
parameters$beta_lambda <- 1

#set the hyperparameters of gamma prior used for intermediate noise
parameters$alpha_upsilon <- 1
parameters$beta_upsilon <- 1

#set the hyperparameters of gamma prior used for bias
parameters$alpha_gamma <- 1
parameters$beta_gamma <- 1

#set the hyperparameters of gamma prior used for kernel weights
parameters$alpha_omega <- 1
parameters$beta_omega <- 1

#set the hyperparameters of gamma prior used for output noise
parameters$alpha_epsilon <- 1
parameters$beta_epsilon <- 1

#set the number of iterations
parameters$iteration <- 200

#determine whether you want to calculate and store the lower bound values
parameters$progress <- 0

#set the seed for random number generator used to initalize random variables
parameters$seed <- 1606

# set the number of kernels (e.g., the number of views)
P <- 4

#computing Gaussian kernel on ccle data set
Ktrain <- ytrain <- Ktest <- vector(mode = "list", ndrugs)
for(drug in 1:ndrugs){
  
  y <- drug.resp[, drug]
  keep <- !is.na(y)
  y <- y[!is.na(y)]
  n0 <- length(y) + nrow(all.data.valid[[1]])
  K <- array(dim = c(n0, n0, 4))
  for(p in 1:P){
    K[, , p] <- gausskernel(rbind(all.data[[p]][keep, ], all.data.valid[[p]]), 
                            sigma = ncol(all.data[[p]]))
  }
  Ktrain[[drug]] = K[c(1:length(y)), c(1:length(y)), ]
  ytrain[[drug]] = y
  Ktest[[drug]] = K[c(1:length(y)), -c(1:length(y)), ]
  
}

#fitting model
state <- bayesian_multitask_multiple_kernel_learning_train(Ktrain, ytrain,
                                                           parameters)
prediction <- bayesian_multitask_multiple_kernel_learning_test(Ktest, state)

bmtmkl.pred <- matrix(0, nrow(resp.valid), ndrugs)
for(drug in 1:ndrugs){
  bmtmkl.pred[, drug] = prediction$f[[drug]]$mean
}

# mse
bmtmkl.mses <- colMeans((bmtmkl.pred - resp.valid)^2, na.rm = T)

# correlation
bmtmkl.corrs <- diag(cor(x = bmtmkl.pred, y = resp.valid,
                         use = "pairwise.complete.obs"))

# pc
bmtmkl.pcs <- pcs(bmtmkl.pred, resp.valid, pc.null)

# wpc
bmtmkl.wpc <- wpc(bmtmkl.pred, resp.valid, pc.null)

saveRDS(bmtmkl.mses, paste0(save.path, "bmtmkl_mse.rds"))
saveRDS(bmtmkl.corrs, paste0(save.path, "bmtmkl_corr.rds"))
saveRDS(bmtmkl.pcs, paste0(save.path, "bmtmkl_pc.rds"))
saveRDS(bmtmkl.wpc, paste0(save.path, "bmtmkl_wpc.rds"))
saveRDS(bmtmkl.pred, paste0(save.path, "bmtmkl_preds.rds"))

Table ?? compares the validation accuracies between the integrative and non-integrative methods when aggregating the prediction accuracies across all 24 drugs. We see that in this case, the data integration methods do not improve the prediction accuracy as one might hope.


This finding remains true when examining the prediction accuracies at the individual drug level. In the figures below, we compare the validation MSEs when applying various methods (namely, kernel ridge and RF) to the full concatenated CCLE data versus only the RNASeq data. For most drugs, fitting on the concatenated data yields slightly worse prediction accuracy than fitting on the RNASeq data alone. Potential reasons for the lack of improvement when concatenating include the ultra-high dimensionality of concatenated data sets and the potential increase in the noise-to-signal ratio when adding in a substantial number of extra (possibly noisy) variables.

Validation MSE

We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the MSE obtained on the validation set.

(#fig:sub_chunk_120)We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the MSE obtained on the validation set.

Validation R2

We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the R2 obtained on the validation set.

(#fig:sub_chunk_121)We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the R2 obtained on the validation set.

Validation Correlation

We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the Correlation obtained on the validation set.

(#fig:sub_chunk_122)We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the Correlation obtained on the validation set.

Validation PC-index

We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the PC-index obtained on the validation set.

(#fig:sub_chunk_123)We train various models to predict the 24 drug responses using the concatenated data set and the RNASeq data set, and we plot the PC-index obtained on the validation set.

However, considering that our primary goal is not purely prediction, the differences between model prediction accuracies shown in Table ?? are relatively small from a practical viewpoint. In our inferential procedure discussed next, we will see that leveraging the stability across these methods with similar predictive accuracies is key to our staDRIP pipeline for identifying genes and proteins that are stable predictive features underlying the drug response models.

To conclude our investigation of the prediction performances, we finally report an unbiased estimate of the test prediction error for our model-building pipeline. That is, we use our best model, the RNASeq-trained kernel ridge regression model, to predict the drug responses for cell lines in the test set and report an estimated test MSE (\(\pm\) 1SD) of 0.503 (\(\pm\) 0.039), \(R^2\) of 0.204 (\(\pm\) 0.038), correlation of 0.453 (\(\pm\) 0.041), and WPC-index of 0.620 (\(\pm\) 0.0075) across the 24 drugs. The individual drug-level prediction accuracies for the test set are provided in Table ??.

# refit rnaseq kernel ridge regression on train+valid and evaluate on test
registerDoParallel(cores = 24)
out <- foreach(drug = 1:ndrugs, 
               .packages = c("glmnet", "KRLS", "caret")) %dopar% {
  
  Xtrain <- rbind(as.matrix(all.data[[type]]),
                  as.matrix(all.data.valid[[type]]))
  Xvalid <- as.matrix(all.data.test[[type]])
  
  # remove data with missing responses
  ytr <- c(drug.resp[, drug], resp.valid[, drug])
  missing.idx <- which(is.na(ytr))
  tissue.all <- c(tissue, tissue.valid)
  if (length(missing.idx) >= 1) {
    Xtrain <- Xtrain[-missing.idx, ]
    y <- ytr[-missing.idx]
    tissue.type <- tissue.all[-missing.idx]
  }else {
    y <- ytr
    tissue.type <- tissue.all
  }
  
  # stratified cv folds
  foldid <- data.frame(tissue = tissue.type) %>%
    group_by(tissue) %>%
    mutate(split.idx = sample(rep(1:nfolds, each = ceiling(n() / nfolds)),
                              size = n(), replace = F)) %>%
    pull(split.idx)
  
  # fit kernel
  kernel.fit <- krls(X = Xtrain, y = y, 
                     whichkernel = "gaussian", print.level = 0)

  # make predictions
  kernel.pred <- predict(kernel.fit, newdata = Xvalid)$fit

  list(kernel.fits = kernel.fit,
       kernel.preds = kernel.pred)
}

kernel.fits <- lapply(out, FUN = function(X) {return(X$kernel.fits)})
kernel.preds <- mapply(out, FUN = function(X) {return(X$kernel.preds)})

indiv.preds.ls <- list(Kernel = kernel.preds)

# compute mse
test.mses <- mapply(indiv.preds.ls, FUN = function(preds) {
  return(colMeans((preds - resp.test)^2, na.rm = T))
}) %>% t()

# compute correlation
test.corrs <- mapply(indiv.preds.ls, FUN = function(preds) {
  return(diag(cor(x = preds, y = resp.test, use = "pairwise.complete.obs")))
}) %>% t()

# compute pc metric
pc.null <- pc.nulldist(resp.test)
test.pcs <- mapply(indiv.preds.ls, FUN = function(preds) {
  return(pcs(preds, resp.test, pc.null))
}) %>% t()

# compute wpc metric
test.wpcs <- mapply(indiv.preds.ls, FUN = function(preds) {
  return(wpc(preds, resp.test, pc.null))
})

saveRDS(test.mses, paste0(save.path, "test_kernel_rnaseq_mse.rds"))
saveRDS(test.corrs, paste0(save.path, "test_kernel_rnaseq_corr.rds"))
saveRDS(test.pcs, paste0(save.path, "test_kernel_rnaseq_pc.rds"))
saveRDS(test.wpcs, paste0(save.path, "test_kernel_rnaseq_wpc.rds"))
saveRDS(kernel.fits, paste0(save.path, "test_kernel_rnaseq_fits.rds"))
saveRDS(kernel.preds, paste0(save.path, "test_kernel_rnaseq_preds.rds"))

# compute sd by doing bootstrapping on test set
indiv.preds.ls <- list()
test.ls <- list()
for (b in 1:100) {
  bsamp <- sample(1:nrow(kernel.preds), nrow(kernel.preds), replace = T)
  indiv.preds.ls[[b]] <- kernel.preds[bsamp, ]
  test.ls[[b]] <- resp.test[bsamp, ]
}

# compute point estimate and sd of test mse
test.mse <- mean(test.mses)
test.mse.sd <- mapply(indiv.preds.ls, test.ls, FUN = function(x, y) {
  return(colMeans((x - y)^2, na.rm = T))
}) %>% 
  colMeans() %>%
  sd()

# compute point estimate and sd of test r2
test.vars <- apply(resp.test, 2,
                   FUN = function(x) {
                     n <- sum(!is.na(x))
                     return((n-1) / n * var(x, na.rm = T))
                   })
test.r2 <- mean(1 - test.mses / test.vars)
test.r2.sd <- mapply(indiv.preds.ls, test.ls, FUN = function(x, y) {
  return(1 - colMeans((x - y)^2, na.rm = T) / test.vars)
}) %>% 
  colMeans() %>%
  sd()

# compute point estimate and sd of test correlation
test.corr <- mean(test.corrs)
test.corr.sd <- mapply(indiv.preds.ls, test.ls, FUN = function(x, y) {
  return(diag(cor(x = x, y = y, use = "pairwise.complete.obs")))
}) %>%
  colMeans() %>%
  sd()

# compute point estimate and sd of wpc-index
test.wpc <- test.wpcs
pc.null <- pc.nulldist(resp.test)
test.wpc.sd <- mapply(indiv.preds.ls, test.ls, FUN = function(x, y) {
  return(wpc(x, y, pc.null))
}) %>%
  sd()

6 Stability analysis

set.seed(123567)

# do PCS inference to find important stable genes and proteins
protein = all.data[['prot']]
rnaseq = all.data[['rnaseq']]

get_stab_scores <- function(drug, boot = 100){
    
    protein.escv.model = rep(0, ncol(protein))
    protein.rf.model = rep(0, ncol(protein))
    rnaseq.escv.model = rep(0, ncol(rnaseq))
    rnaseq.rf.model = rep(0, ncol(rnaseq))
    protein.enet.model = rep(0, ncol(protein))
    rnaseq.enet.model = rep(0, ncol(rnaseq))
    n = nrow(rnaseq)
    
    for(i in 1:boot){
        
        idxs = sample.int(n, n, replace = TRUE)
        x.train.rnaseq = as.matrix(rnaseq[idxs,])
        x.train.protein = as.matrix(protein[idxs,])
        y.train = drug.resp[idxs, drug]
        non.na = which(is.na(y.train) != TRUE)
        x.train.rnaseq = x.train.rnaseq[non.na, ]
        x.train.protein = x.train.protein[non.na, ]
        y.train = y.train[non.na]
        
        # separate model for prot and rnaseq, 
        # for blockwise model, replace y_train with y_train-escv.prediction (or cv.prediction, rf.prediction)
        
        # escv
        protein.escv = escv.glmnet(x.train.protein, 
                                  y.train, 
                                  lambda = seq(0, 0.2, 0.002),
                                  nfolds = 5)
        # rf
        protein.df = data.frame(as.matrix(x.train.protein), y = y.train)
        ra = ranger(y~., data = protein.df, num.trees = 100, importance = 'impurity')
        
        # enet
        protein.enet = cv.glmnet(x_train_protein, y_train, alpha = 0.5, nfolds = 5, nlambda = 20) 
        
        
        protein.escv.model = protein.escv.model + (coef(protein.escv$glmnet.fit, s = protein.escv$lambda.escv)[-1] !=0)
        protein.enet.model = protein.enet.model + (coef(protein.enet, s = protein.enet$lambda.min)[-1] !=0)
        #thresh_rf = mean(ra$variable.importance) + 2*sqrt(var(ra$variable.importance))
        protein.rf.model = protein.rf.model + (ra$variable.importance)
        
        escv.prediction = predict(protein.escv$glmnet.fit, x.train.protein, s = protein.escv$lambda.escv)
        cv.prediction = predict(protein.escv$glmnet.fit, x.train.protein, s = protein.escv$lambda.cv)
        rf.prediction = ra$predictions
        
        rnaseq.escv = escv.glmnet(x.train.rnaseq, 
                                  y.train, 
                                  lambda = seq(0, 0.2, 0.002),
                                  nfolds = 5)
        
        rnaseq.df = data.frame(as.matrix(x.train.rnaseq), y = y.train)
        ra = ranger(y~., data = rnaseq.df, num.trees = 100, importance = 'impurity')
        
        rnaseq.enet = cv.glmnet(x_train_rnaseq, y_train, alpha = 0.5, nfolds = 5, nlambda = 20)

        rnaseq.escv.model = rnaseq.escv.model + (coef(rnaseq.escv$glmnet.fit, s = rnaseq.escv$lambda.escv)[-1] !=0)
        rnaseq.enet.model = rnaseq.enet.model + (coef(rnaseq.enet, s = rnaseq.enet$lambda.min)[-1] !=0)
        #thresh_rf = mean(ra$variable.importance) + 2*sqrt(var(ra$variable.importance))
        rnaseq.rf.model = rnaseq.rf.model + (ra$variable.importance) 

    }
    return(list(rnaseq.escv = rnaseq.escv.model,
                rnaseq.rf = rnaseq.rf.model,
                rnaseq.enet = rnaseq.enet.model,
                protein.escv = protein.escv.model,
                protein.rf = protein.rf.model,
                protein.enet = protein.enet.model
                ))
}

scores = mclapply(1:24, get_stab_scores, mc.cores = 6)

saveRDS(scores, paste0(save.path, "stab_scores_separate.rds"))
scores = readRDS(paste0(save.path, "stab_scores_separate.rds"))

stab.features = data.frame(prot = rep(" ", 24), rnaseq = rep(" ", 24), stringsAsFactors = FALSE)
rownames(stab.features) = colnames(drug.resp)
protein = all.data[['prot']]
rnaseq = all.data[['rnaseq']]

info <- read.table("./data/raw/ccle_RNAseq_RPKM_Gene_Info.txt",
                 header = TRUE)
rnaseq.dict = list()
for(i in 1:nrow(info)){
    rnaseq.dict[[as.character(info[i, 1])]] = as.character(info[i, 2])
}

for (drug in 1:24){
    
    score = scores[[drug]]
    protein.escv.top = order(score[['protein.escv']], decreasing = TRUE)[1:10]
    protein.enet.top = order(score[['protein.enet']], decreasing = TRUE)[1:10]
    protein.rf.top = order(score[['protein.rf']], decreasing = TRUE)[1:10]
    rnaseq.escv.top = order(score[['rnaseq.escv']], decreasing = TRUE)[1:10]
    rnaseq.enet.top = order(score[['rnaseq.enet']], decreasing = TRUE)[1:10]
    rnaseq.rf.top = order(score[['rnaseq.rf']], decreasing = TRUE)[1:10]
    protein.index = intersect(intersect(protein.escv.top, protein.rf.top), protein.enet.top)
    rnaseq.index = intersect(intersect(rnaseq.escv.top, rnaseq.enet.top), rnaseq.rf.top)
    rank.sum = sapply(protein.index, 
                      function(pi) {which(protein.escv.top == pi) + 
                                      which(protein.enet.top == pi) +
                                    which(protein.rf.top == pi)})
    rank.sum.rnaseq = sapply(rnaseq.index, 
                      function(pi) {which(rnaseq.escv.top == pi) + 
                                      which(rnaseq.enet.top == pi) +
                                    which(rnaseq.rf.top == pi)})

    protein.features = colnames(protein)[protein.index]
    rnaseq.features = sapply(rnaseq.index, 
                             function(index) rnaseq.dict[[colnames(rnaseq)[index]]])
    stab.features[drug,] = c(as.character(paste(protein.features[order(rank.sum)], collapse = ", ")), as.character(paste(rnaseq.features[order(rank.sum.rnaseq)], collapse = ", ")))
}

myKable(stab.features,
        caption = "Stable protein and RNAseq signatures. A feature is included if it is among the top 10 most stable features under 3 different machine learning models (i.e., Elastic net (CV), Lasso (ESCV), and random forest). The stability of the features are computed from the PCS inference framework. Blank cells indicate that no features appeared among the top 10 most stable features for all three models.")
Table 6.1: Stable protein and RNAseq signatures. A feature is included if it is among the top 10 most stable features under 3 different machine learning models (i.e., Elastic net (CV), Lasso (ESCV), and random forest). The stability of the features are computed from the PCS inference framework. Blank cells indicate that no features appeared among the top 10 most stable features for all three models.
prot rnaseq
17-AAG Bax, p53_Caution, Caspase-7_cleavedD198_Caution, eIF4E CTD-2033A16.1, AP2S1, BZW2
AEW541 Akt_pT308, Smad1, p27_pT198, PTEN, RAD51 B4GALT3, SEMA3B
AZD0530 p38_pT180_Y182, c-Kit HPGD
AZD6244 PI3K-p85, TFRC, Bax SPRY2, RP11-1143G9.4, LYZ, DUSP6, PRSS57
Erlotinib EGFR_pY1068_Caution, Beclin_Caution, P-Cadherin_Caution PIP4K2C, SEC61G
Irinotecan MDMX_MDM4(BetIHC-00108)_Caution, Src_pY416_Caution
L-685458 YAP_pS127_Caution, VEGFR2, Src_pY416_Caution, Src
LBW242 ASNS MRPL24
Lapatinib HER2_pY1248_Caution, HER3_pY1289_Caution, EGFR_pY1068_Caution, Rab25, Heregulin STARD3
Nilotinib STAT5-alpha, c-Kit, SHP-2_pY542_Caution, Src_pY416_Caution, p27_pT198
Nutlin-3 Bcl-2, Bax
PD-0325901 MEK1_pS217_S221, Bax, TFRC, PI3K-p85 SPRY2, DUSP6, ETV4
PD-0332991 Bcl-2, MDMX_MDM4(BetIHC-00108)_Caution, Src_pY416_Caution
PF2341066 c-Met_pY1235, c-Met_Caution CAPZA2
PHA-665752 MEK1 FMNL1
PLX4720 MEK1_pS217_S221, Bax, PREX1, Beclin_Caution FABP7
Paclitaxel Src_pY416_Caution, beta-Catenin ORMDL2
Panobinostat VEGFR2, Src
RAF265 PI3K-p85, FOXO3a_Caution, eEF2K RETN
Sorafenib Bcl-2, Src
TAE684 PTEN, Akt_pT308, p70S6K, Bcl-2 H1FX
TKI258 C-Raf_pS338, CD49b
Topotecan OSGIN1
ZD-6474 c-Kit, STAT5-alpha

Given that the predictive performance is fairly similar across most models discussed above, understanding the important features that led to the predicted drug responses and are stable across models may be of more interest. A couple of the models discussed previously (namely, the Lasso and RF) provide measures of feature importance, and we will discuss these selected features here.

In our work, we are primarily interested in identifying protein and gene signatures that are associated with a particular drug’s response. This is because (1) protein and RNA signatures are directly related to the known drug targets, and it is easier to compare genomic signatures found by our procedure with those previously identified in biomedical literature; (2) the protein and RNAseq data set have the highest predictive accuracy when using alone.

The details of our pipeline are as follows. First, we randomly draw \(B=100\) bootstrap samples \(\mathcal{D}^{(b)}, b=1, \dots, 100\) from the training data. For the CCLE dataset, bootstrapping is considered as appropriate data perturbation. For each bootstrap sample, we consider 3 models, which were shown to have high predictive accuracies in the previous section and hence are reasonable fits for the data: (1) Lasso with tuning parameter selected by ESCV (2) Lasso with tuning parameter selected by ESCV and (3) Random Forest. Note that although Gaussian kernel ridge regression has the highest average prediction accuracy over 24 drugs, extracting relevant feature importance measures from the fitted kernel regression model remains to be an open question.

Each of the above three models is fitted using the protein and RNAseq data separately. We fit separate models to the two datasets due to the need to jointly interpret the RNASeq and protein expressions and the aforementioned limitations of the alternative concatenating approach. Next, for each feature \(X_j\) from either the protein or RNAseq data set, let \(\omega_j^{(b)}\) be defined in the following way: for the Lasso, \(\omega_j^{(b)}=1\) if the coefficient of \(X_j\) is non-zero, and \(\omega_j^{(b)}=0\) otherwise; for the random forest, \(\omega_j^{(b)}\) is the MDI feature importance of \(X_j\). We then define the stability score \(\text{sta}(X_j)\) of each feature \(X_j\) as \(\text{sta}(X_j) = \frac{1}{B}\sum_{b=1}^{B}\omega_j^{(b)}\) and rank the proteins and genes separately by the stability scores of the features. There are several proteins and genes which we found to be among the top 10 features, ranked by stability score, for all three models. These stable features are listed in Table 6.1.

7 Discussion

Rooted by the PCS framework, we emphasize the importance of predictability, (computability), and stability as minimum requirements for extracting scientific knowledge throughout the staDRIP pipeline. We show that, guided by good prediction performance, incorporating a number of stability checks and extracting the stable parts of top-performing models can help to avoid the poor generalization exhibited by existing methods and can successfully identify candidate therapeutic targets for future preclinical research. We also acknowledge that while many stability considerations are built into staDRIP, there are inevitably human judgment calls that still impact our analysis. This documentation is one effort to ensure transparency and appropriate justifications for these decisions.

Bibliography

Barretina, Jordi, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin, Sungjoon Kim, Christopher J Wilson, et al. 2012. “The Cancer Cell Line Encyclopedia Enables Predictive Modelling of Anticancer Drug Sensitivity.” Nature 483 (7391). Nature Publishing Group: 603.

Costello, James C, Laura M Heiser, Elisabeth Georgii, Mehmet Gönen, Michael P Menden, Nicholas J Wang, Mukesh Bansal, et al. 2014. “A Community Effort to Assess and Improve Drug Sensitivity Prediction Algorithms.” Nature Biotechnology 32 (12). Nature Publishing Group: 1202.

Jang, In Sock, Elias Chaibub Neto, Justin Guinney, Stephen H Friend, and Adam A Margolin. 2014. “Systematic Assessment of Analytical Methods for Drug Sensitivity Prediction from Cancer Cell Line Data.” In Biocomputing 2014, 63–74. World Scientific.

Kohane, Isaac S. 2015. “Ten Things We Have to Do to Achieve Precision Medicine.” Science 349 (6243). American Association for the Advancement of Science: 37–38.

Lim, Chinghway, and Bin Yu. 2016. “Estimation Stability with Cross-Validation (Escv).” Journal of Computational and Graphical Statistics 25 (2). Taylor & Francis: 464–92.

Nelson, David L, Albert L Lehninger, and Michael M Cox. 2008. Lehninger Principles of Biochemistry. Macmillan.

Rubin, Mark A. 2015. “Health: Make Precision Medicine Work for Cancer Care.” Nature News 520 (7547): 290.

Simidjievski, Nikola, Cristian Bodnar, Ifrah Tariq, Paul Scherer, Helena Andres-Terre, Zohreh Shams, Mateja Jamnik, and others. 2019. “Variational Autoencoders for Cancer Data Integration: Design Principles and Computational Practice.” BioRxiv. Cold Spring Harbor Laboratory, 719542.